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Abstract 

A discrete auditory transform (DAT) from sound signal to spec- 
trum is presented and shown to be invertible in closed form. The 
transform preserves energy, and its spectrum is smoother than that of 
the discrete Fourier transform (DFT) consistent with human audition. 
DAT and DFT are compared in signal denoising tests with spectral 
thresholding method. The signals are noisy speech segments. It is 
found that DAT can gain 3 to 5 decibel (dB) in signal to noise ratio 
(SNR) over DFT except when the noise level is relatively low. 
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1 Introduction 



Audible acoustic signal processing often consists of frame by frame discrete 
Fourier transform (DFT) of input signal followed by spreading in Fourier 
spectrum using the critical band filter, see M. Schroeder et al jlj. These two 
steps mimic the responses of human audition to the input signal, facilitating 
the computation of excitation pattern over critical bands and psychoacoustics- 
based processing However, the spreading operation in step two is not 
invertible. 

In this work, an invertible discrete auditory transform (DAT) is formu- 
lated to combine the two steps into one. DAT incorporates the spectral 
spreading functions of Schroeder et al [3], which leads to smoother spectrum 
than that of the discrete Fourier transform (DFT). As a result, DAT has 
better localization properties in the time domain. DAT bears some resem- 
blance to the wavelet transform [2] in that a function of one variable (time) 
is transformed into a function of two variables (time and frequency). It is 
this redundancy that makes the inversion possible and explicit. 

The paper is organized as follows. In section 2, a general form of DAT is 
introduced and its inversion established. In section 3, a specific DAT is given 
based on the known auditory spectral energy spreading functions of Schroeder 
et al [3]. DAT spectrum is defined and compared with DFT spectrum. Time 
localization property of DAT basis functions is illustrated as well. In section 
4, noisy signal reconstruction from thresholded spectrum is carried out and 
its signal to noise ratio is computed by using the reconstructed signal and the 
clean (noise free) signal. The signal is a segment of male or female speech, 
and can be either voiced (e.g. vowels) or unvoiced (consonants such as s and 
f). DAT is found to gain by 3 to 5 decibel (dB) in signal to noise ratio (SNR) 
over DFT in such a denoising task. Concluding remarks are made in section 



2 Discrete Auditory Transform 

Let s = (sq, ■ • • , , sjv-i) be a discrete signal, and § its discrete Fourier trans- 
form (DFT) PQ: 
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The DFT inversion formula is: 
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For two signals s and t of length N, the Plancherel-Parseval equality is: 
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implying the energy identity: 
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Define the discrete auditory transform (DAT): 
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S j,m = S l K j-l,m, (2.5) 



1=0 

where the double indexed discrete kernel function is given by: 
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where the matrix X m>n has square sum equal to one in m: 
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\X m ,n\ 2 = 1, Vn. (2.7) 
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2.1 Energy Identity 

Let us show the energy conservation property of the transform. Upon sub- 
stituting ()2.6|) into (|2.5|) . the transform can be written as: 
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which is similar to the representation of time domain solutions of cochlear 
models as a sum of time harmonic solutions E] • 
It follows from flZSJ, (Q, and (Q that: 
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implying: 
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Polarizing with (|2.9|) , one finds the analogous Plancherel-Parseval identity of 
DAT for two signals s and t: 
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2.2 Inversion 

The explicit inversion formula is: 



M-l AT-l AT-l 

- ^ E E ^ E e^(i-0»/^). (2.11) 



m=0 1=0 n=0 

Proof: Consider the sum in I. In view of (|2.8|1 . we see that 
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So the right hand side of (|2.11|) is equal to: 
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which equals upon summing over m and using ()2.7|) : 
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3 Transform Kernel and Spectrum 



The role of the transform kernel X mn is to spread the DFT vector s n . Here 
our knowledge of human audition will be utilized. Let us adopt the real 
nonnegative energy spreading function of Schroeder et al [2], denoted by 
S m ,n = S(b(f m ),b(f n )), where f m is the frequency to spread from, /„ is the 
frequency to spread to, and b is the standard mapping from Hertz (Hz) to 
Bark scale The functional form of S(-, -) is given in Schroeder et al jl]. 

The DFT of a real vector s satisfies the symmetry property §k = s* N _ k , 
k = 1,2 ■ ■ ■ , N — 1. It is natural for the spreading kernel X m ^ n to respect this 
symmetry. Suppose the discrete signal s has sampling frequency F s (Hz). 
The DFT component s n (0 < n < N/2, N a power of 2) corresponds to 
frequency: 

f n = F s -n/N, n<N/2. (3.1) 

Let U m>n = U(f m , f n ) = S l ' 2 {b{f m ), b(f n )), < m < M - 1, M = N/2. The 
square root is to convert spreading from energy to amplitude scale. Then 
normalize U m ^ n to define X m n as follows: 
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X m , n = U (f m >J N -"\ N/2<n<N-l, 



where the rrif function is: 
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m f(f)= [ E \ U (fmJ)\ 2 ) ■ (3-2) 
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We see that X m ^ n is symmetric in n with respect to N/2, and periodic in 
n. The normalization property (square sum equal to one) with respect to 
m holds. See Figure 1 for a plot of X m ^ n in n, at m = 5 : 10 : 55, where 
N = 128, Fs = 16000 Hz. 
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3.1 DAT Spectrum 



In view of formula (|2.8jl . we extract the amplitude (intensity) of the S^. 
(j G [0, N — 1]) for each m and define the DAT spectrum: 



As the input sound signal is divided into frames of length N, DAT spectrum 
can vary from frame to frame in time. 

3.2 Transform Properties 

Let us illustrate the DAT properties by considering a 500 Hz square wave 
(top panel of Figure 2). Middle and bottom panels of Figure 2 show DFT 
and DAT spectra of the 500 Hz square wave in the first frame, for N = 128, 
and M = N/2 = 64. The later frames are similar. Compared with DFT 
spectrum, DAT spectrum is smoother, especially towards higher frequencies. 

In the time domain representation (|2.11|) . N times the inverse DFT of 
X^ )n in n for each m plays the role of basis functions. Figure 3 shows such 
functions for m = 20 and m = 40, their time domain localization property 
reflects the smoother DAT spectrum. 



DAT and DFT were used to denoise speech signals. Both were numerically 
implemented with FFT. A simple thresholding method in the transformed do- 
main was applied to improve the signal-to-noise ratio (SNR) of noisy speech. 
The underlying assumption of the method is that low level components in 
the transformed domain are more likely to be noise than signal plus noise. 
Thresholding, therefore, could improve the overall SNR of the signal. It is 
a simple denoise method. More advanced methods exist for noise reduction 
and will be studied in the future. The simple thresholding method serves as a 
tool here to reveal the difference between DAT and DFT in signal processing. 

Voiced and unvoiced speech segments were selected from a male and a 
female speaker respectively. Each segment has 512 data points. Noisy speech 
was created by adding Gaussian noise to the selected segments. The level of 
noise was set to produce the SNR ranging from -12 dB to +12 dB with a 3 




(3.3) 



4 DAT and DFT in Signal Denoising 
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dB step size. DAT and DFT were applied to the noisy speech signals. The 
magnitude of transformed components were then compared to a threshold. 
All components with magnitude smaller than the threshold were ignored for 
the reconstruction of the signal. Here, the threshold was computed as the 
average of the DFT magnitude spectrum. Signal was reconstructed directly 
by the inverse DAT and DFT, respectively. The SNRs of the reconstructed 
signal was computed and shown in Figure 4. Samples of original signal, its 
noise added signal (SNR = dB), and the signal denoised by thresholding 
were shown for a voiced (Figure 5) and unvoiced (Figure 6) speech segment. 

These results indicate that DAT thresholding has about 3 to 5 dB SNR 
gain over DFT thresholding for voiced speech signals. The improvement 
is larger when the noise level is relatively high. For unvoiced (noise-like) 
speech signal, DAT thresholding also has a SNR gain when the noise level 
is relatively high. The DFT thresholding, however, has higher SNRs when 
noise level is relatively low. The DAT thresholding appears to have difficulty 
in discriminating between the original and added noise for unvoiced speech 
segments when the noise added is relatively low. This should not be as 
handicapping, in part because it may not be as necessary to denoise low-level 
noise when the speech itself is noise-like (see Figure 6). Denoise is mostly 
needed for voiced signals with high-level noise. The potential advantage of 
DAT thresholding is well demonstrated by the 3 to 5 dB improvement of 
SNR when noise level is relatively high. The noise-reduction advantage is 
likely a result of the spectral spreading (weighted local spectral averaging) 
operation of DAT. 

5 Concluding Remarks 

Discrete auditory transforms (DAT) are introduced and shown to be invert- 
ible and energy preserving. DAT spectra are smoother than DFT's, and 
DAT basis functions are more localized than DFT's in the time domain. 
In signal denoising with spectral thresholding method, it is observed that 
DAT increased SNR by 3 to 5 dB over DFT. Further study of DAT will be 
worthwhile in more complicated signal processing tasks. 
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Figure 1: The transform kernel X nhn as a function of integer n — 1 to 128. 
The m variable equals 5 (solid), 15 (dash), 25 (line-star), 35 (dashdot), 45 
(line-circle), 55 (dot). 
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Figure 2: A 500 Hz square wave (top), its DFT spectrum (middle) and DAT 
spectrum (bottom). 
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Figure 3: An illustration of localization property of DAT basis functions in 
the time domain. 
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Figure 4: Comparison of DAT (solid) and DFT (dashdot) denoising by spec- 
tral thresholding for male /female, voiced/unvoiced speech segments. 
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Figure 5: (Top down) voiced speech signal, noisy signal (SNR = dB), 
denoised signal. 
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Figure 6: (Top down) unvoiced speech signal, noisy signal (SNR = dB), 
denoised signal. 
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